Efficient Development of Lexical Language Resources and their Representation

نویسندگان

Matej Rojc

Zdravko Kacic

چکیده

Statistical approaches in speech technology, whether used for statistical language models, trees, hidden Markov models or neural networks, represent the driving forces for the creation of language resources (LR), e.g., text corpora, pronunciation and morphology lexicons, and speech databases. This paper presents a system architecture for the rapid construction of morphologic and phonetic lexicons, two of the most important written language resources for the development of ASR (automatic speech recognition) and TTS (text-to-speech) systems. The presented architecture is modular and is particularly suitable for the development of written language resources for inflectional languages. In this paper an implementation is presented for the Slovenian language. The integrated graphic user interface focuses on the morphological and phonetic aspects of language and allows experts to produce good performances during analysis. In multilingual TTS systems, many extensive external written language resources are used, especially in the text processing part. It is very important, therefore, that representation of these resources is time and space efficient. It is also very important that language resources for new languages can be easily incorporated into the system, without modifying the common algorithms developed for multiple languages. In this regard the use of large external language resources (e.g., morphology and phonetic lexicons) represent an important problem because of the required space and slow look-up time. This paper presents a method and its results for compiling large lexicons, using examples for compiling German phonetic and morphology lexicons (CISLEX), and Slovenian phonetic (SIflex) and morphology (SImlex) lexicons, into corresponding finite-state transducers (FSTs). The German lexicons consisted of about 300,000 words, SIflex consisted of about 60,000 and SImlex of about 600,000 words (where 40,000 words were used for representation using finite-state transducers). Representation of large lexicons using finite-state transducers is mainly motivated by considerations of space and time efficiency. A great reduction in size and optimal access time was achieved for all lexicons. The starting size for the German phonetic lexicon was 12.53 MB and 18.49 MB for the morphology lexicon. The starting size for the Slovenian phonetic lexicon was 1.8 MB and 1.4 MB for the morphology lexicon. The final size of the corresponding FSTs was 2.78 MB for the German phonetic lexicon, 6.33 MB for the German morphology lexicon, 253 KB for SIflex and 662 KB for the SImlex lexicon. The achieved look-up time is optimal, since it only depends on the length of the input word and not on the size of the lexicon. Integration of lexicons for new languages into the multilingual TTS system is easy when using such representations and does not require any changes in the algorithms used for such lexicons.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

language development and lexical awareness of bilingual (Azeri -Persian) hard of hearing impaired children

The Relationship between Mean Length of utterance (MLU), Lexical Richness and syntactical and lexical metalinguistic Awareness in Bilingual (Turkish-Persian) normal and hearing impaired Children Objectives: Regarding the impact of hearing loss on language development and metalinguistic skill and being language development different from metalinguistic skill in bilingual children, studying of...

متن کامل

The Representation of Iran’s Nuclear Program in British Newspaper Editorials: A Critical Discourse Analytic Perspective

In this study, Van Dijk’s (1998) model of CDA was utilized in order to examine the representation of Iran’s nuclear program in editorials published by British news casting companies. The analysis of the editorials was carried out at two levels of headlines and full text stories with regard to the linguistic features of lexical choices, nominalization, passivization, overcompleteness, and voice....

متن کامل

Word Type Effects on L2 Word Retrieval and Learning: Homonym versus Synonym Vocabulary Instruction

The purpose of this study was twofold: (a) to assess the retention of two word types (synonyms and homonyms) in the short term memory, and (b) to investigate the effect of these word types on word learning by asking learners to learn their Persian meanings. A total of 73 Iranian language learners studying English translation participated in the study. For the first purpose, 36 freshmen from an ...

متن کامل

Mental Representation of Cognates/Noncognates in Persian-Speaking EFL Learners

The purpose of this study was to investigate the mental representation of cognate and noncognate translation pairs in languages with different scripts to test the prediction of dual lexicon model (Gollan, Forster, & Frost, 1997). Two groups of Persian-speaking English language learners were tested on cognate and noncognate translation pairs in Persian-English and English-Persian directions with...

متن کامل

Early Phonological and Lexical Development of a Farsi Speaking Child: A Longitudinal Case Study

The present study aims at the description and analysis of the phonological and lexical development of a child who is acquiring Farsi as his first language. The child's language production at the holophrastic stage of language development, mainly single words, is observed and recorded longitudinally for nearly seven months since he was 16 months old until he turned 23 months. An attempt is mad...

متن کامل

Comparative Study of Degree of Bilingualism in Lexical Retrieval and Language Learning Strategies

This study compares lexical retrieval amongst monolinguals and intermediate bilinguals and advanced bilinguals. It also investigates the possible effects of their language learning strategies on their respective lexical retrieval advantage. The study used a mixed methods design and the groups consisted of 20 Persian near-monolinguals, 20 Persian-English intermediate level bilinguals, and 20 Per...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

I. J. Speech Technology

دوره 6 شماره

صفحات -

تاریخ انتشار 2003

Efficient Development of Lexical Language Resources and their Representation

نویسندگان

چکیده

منابع مشابه

language development and lexical awareness of bilingual (Azeri -Persian) hard of hearing impaired children

The Representation of Iran’s Nuclear Program in British Newspaper Editorials: A Critical Discourse Analytic Perspective

Word Type Effects on L2 Word Retrieval and Learning: Homonym versus Synonym Vocabulary Instruction

Mental Representation of Cognates/Noncognates in Persian-Speaking EFL Learners

Early Phonological and Lexical Development of a Farsi Speaking Child: A Longitudinal Case Study

Comparative Study of Degree of Bilingualism in Lexical Retrieval and Language Learning Strategies

عنوان ژورنال:

اشتراک گذاری